Part I Basic Analysis

1. Introduction

Have you ever read horror stories when you were young?
Have you ever been scared of horror fiction?
No matter the answer is yes or no, let us make an analysis of the horror fiction. We will analyze texts from Edgar Allan Poe, Mary Shelley, and HP Lovecraft with the goal to find out whether they are appropriate for children to read.

2. Preparations

2.1 Load Libraries

We will take a range of libraries for general data wrangling and general visualisation together with more specialised language-processing tools.
packages.used <- c("ggplot2", "dplyr", "tibble", "tidyr",  "stringr", "tidytext", "topicmodels", "wordcloud2", "wordcloud", "ggridges","corrplot")

# check packages that need to be installed.
packages.needed <- setdiff(packages.used, intersect(installed.packages()[,1], packages.used))

# install additional packages
if(length(packages.needed) > 0) {
  install.packages(packages.needed, dependencies = TRUE, repos = 'http://cran.us.r-project.org')
}

library(ggplot2)
library(dplyr)
library(tibble)
library(tidyr)
library(stringr)
library(tidytext)
library(topicmodels)
library(wordcloud2)
library(wordcloud)
library(ggridges)
library(corrplot)

source("../libs/multiplot.R")

2.2 Load Data

spooky <- read.csv('../data/spooky.csv', as.is = TRUE)

3. An overview of the data structure and content

head(spooky)
summary(spooky)
##       id                text              author         
##  Length:19579       Length:19579       Length:19579      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
# View(spooky)
we found each sentence is assigned an ID. So we checked whether the ID and text of each row was unique
length(unique(spooky$id)) == nrow(spooky)
## [1] TRUE
length(unique(spooky$text)) == nrow(spooky)
## [1] TRUE
Any missing value?
sum(is.na(spooky))
## [1] 0
Change author name to be a factor variable
spooky$author <- as.factor(spooky$author)

4. The Tidy text format

Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. Therefore, we first use the unnest_tokens() function to drop all punctuation and transform all words into lower case.
# Make a table with one word per row and remove `stop words` (i.e. the common words).
spooky_wrd <- unnest_tokens(spooky, word, text)
spooky_wrd <- anti_join(spooky_wrd, stop_words, by = "word")

5. Word Frequency

5.1 Wordcloud

Now we use package “Wordcloud2” to generate some wordclouds to visualise each author’s work. But these plots can only help us have an overview of author’s work. Then, we will use other methods to see the word frequency in each author’s work in the following parts of this charpter.

5.1.1 Wordcloud for EAP

spooky_wrd_EAP <- spooky_wrd%>%
  dplyr::count(author,word)%>%
  dplyr::group_by(author)%>%
  tidyr::spread(author, n)%>%
  dplyr::select(word, EAP) %>%
  dplyr::filter(!is.na(EAP))%>%
  dplyr::arrange(desc(EAP))


figPath = system.file("examples/octopus.jpg",package = "wordcloud2")
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/Wordcloud_EAP")
wordcloud2(spooky_wrd_EAP, figPath = figPath, size = 0.5, color = "random-light", backgroundColor = "black")

5.1.2 Wordcloud for MWS

spooky_wrd_MWS <- spooky_wrd%>%
  dplyr::count(author,word)%>%
  dplyr::group_by(author)%>%
  tidyr::spread(author, n)%>%
  dplyr::select(word, MWS) %>%
  dplyr::filter(!is.na(MWS))%>%
  dplyr::arrange(desc(MWS))


figPath = system.file("examples/white-ghost-hi.jpg",package = "wordcloud2")
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/Wordcloud_MWS")
wordcloud2(spooky_wrd_MWS, figPath = figPath, size = 0.5, color = "random-light", backgroundColor = "black")

5.1.3 Wordcloud for HPL

spooky_wrd_HPL <- spooky_wrd%>%
  dplyr::count(author,word)%>%
  dplyr::group_by(author)%>%
  tidyr::spread(author, n)%>%
  dplyr::select(word, HPL) %>%
  dplyr::filter(!is.na(HPL))%>%
  dplyr::arrange(desc(HPL))


figPath = system.file("examples/black-bird-hi.png",package = "wordcloud2")
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/Wordcloud_HPL")
wordcloud2(spooky_wrd_HPL, figPath = figPath, size = 0.5, color = "random-light", backgroundColor = "black")

5.2 Word frequency of each author’s work

In this part, we could have a clear thought of the word frequency of each author’s work. We can find out which words they used most frequently.
# Counts number of times each author used each word.
author_words <- count(group_by(spooky_wrd, word, author))

# Counts number of times each word was used.
all_words    <- rename(count(group_by(spooky_wrd, word)), all = n)

author_words <- left_join(author_words, all_words, by = "word")
author_words <- arrange(author_words, desc(all))
author_words <- ungroup(head(author_words, 81))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/word_frequency_of_each_author")
ggplot(author_words) +
  geom_col(aes(reorder(word, all, FUN = min), n, fill = author)) +
  xlab(NULL) +
  coord_flip() +
  facet_wrap(~ author) +
  theme(legend.position = "none")

# dev.off()
We can already see that some words are almost equally frequent for all authors, such as “time”. In contrast, “love” is clearly more used by Shelley than by Lovecraft. The word “half” is only found in Poe and Lovecraft, but not in Shelley’s work at a notable frequency.

5.3 Author-dependent word frequencies

Lets start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane. Words that are close to the line (y = x) have similar frequencies in both sets of texts. While words that are far from the line are words that are found more in one set of texts than another.
As we can see in the plots below, there are some words close to the line but most of the words are around the line showing a difference between the frequencies.

5.3.1 EAP vs. HPL

#We need to spread the author (key) and the word frequency (value) across multiple columns. There might be NAs if it was not uesd by an author
tib.word_freqs_author <- spooky_wrd %>%
  dplyr::count(author, word) %>%
  dplyr::group_by(author) %>%
  dplyr::mutate(word_freq_author = n/sum(n)) %>%
  dplyr::select(-n)

tib.word_freqs_author <- tib.word_freqs_author%>%
  tidyr::spread(author, word_freq_author)

#Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)

author_words_EAP_vs_HPL <- tib.word_freqs_author %>%
  dplyr::select(word, EAP, HPL) %>%
  dplyr::filter(!is.na(EAP) & !is.na(HPL))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/word_frequency_EAPvsHPL")
ggplot(data = author_words_EAP_vs_HPL, mapping = aes(x = EAP, y = HPL, color = abs(EAP - HPL))) +
  geom_abline(color = "red", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  theme(legend.position = "none") +
  labs(y = "HP Lovecraft", x = "Edgard Allan Poe")

5.3.2 EAP vs. MWS

#Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)

author_words_EAP_vs_MWS <- tib.word_freqs_author %>%
  dplyr::select(word, EAP, MWS) %>%
  dplyr::filter(!is.na(EAP) & !is.na(MWS))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/word_frequency_EAPvsMWS")
ggplot(data = author_words_EAP_vs_MWS, mapping = aes(x = EAP, y = MWS, color = abs(EAP - MWS))) +
  geom_abline(color = "red", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  theme(legend.position = "none") +
  labs(y = "Mary Wollstonecraft Shelley", x = "Edgard Allan Poe")

5.3.3 HPL vs. MWS

author_words_HPL_vs_MWS <- tib.word_freqs_author %>%
  dplyr::select(word, HPL, MWS) %>%
  dplyr::filter(!is.na(HPL) & !is.na(MWS))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/word_frequency_HPLvsMWS")
ggplot(data = author_words_HPL_vs_MWS, mapping = aes(x = HPL, y = MWS, color = abs(HPL - MWS))) +
  geom_abline(color = "red", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  theme(legend.position = "none") +
  labs(y = "Mary Wollstonecraft Shelley", x = "HP Lovecraft")

5.4 Correlation – Pearson

Next, we will use Pearson for linearity method to calculate a correlation between the authors. In this way, we can verify the similarity or differences these sets of word frequencies by author.
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/word_frequency_correlation")
cor.tib.word_freqs_author <- tib.word_freqs_author[2:4]%>%
  cor(use="complete.obs", method="spearman") %>%
  corrplot(type="lower",
           method="pie",
           diag = F)

There is a correlation of around 0.48 to 0.5 between the different authors.

5.5 TF-IDF

TF stands for term frequency or how often a word appears in a text and it is what is studied above in the word cloud. IDF stands for inverse document frequncy, and it is a way to pay more attention to words that are rare within the entire set of text data that is more sophisticated than simply removing stop words. Multiplying these two values together calculates a term’s tf-idf, which is the frequency of a term adjusted for how rarely it is used. We’ll use tf-idf as a heuristic index to indicate how frequently a certain author uses a word relative to the frequency that ll the authors use the word. Therefore we will find words that are characteristic for a specific author, a good thing to have if we are interested in solving the author identification problem.
frequency <- count(spooky_wrd, author, word)
tf_idf    <- bind_tf_idf(frequency, word, author, n)
head(tf_idf)
tail(tf_idf)
tf_idf    <- arrange(tf_idf, desc(tf_idf))
tf_idf    <- mutate(tf_idf, word = factor(word, levels = rev(unique(word))))

# Grab the top thirty tf_idf scores in all the words 
tf_idf_30 <- top_n(tf_idf, 30, tf_idf)

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/tf_idf")
ggplot(tf_idf_30) +
  geom_col(aes(word, tf_idf, fill = author)) +
  labs(x = NULL, y = "TF-IDF values") +
  theme(legend.position = "top", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))

Note that in the above, many of the words recognized by their tf-idf scores are names. This makes sense – if we see text referencing Raymond, Idris, or Perdita, we know almost for sure that MWS is the author. But some non-names stand out. EAP often uses “monsieur” and “jupiter” while HPL uses the words “bearded” and “attic” more frequently than the others. We can also look at the most characteristic terms per author.
# Grab the top twenty tf_idf scores in all the words for each author
tf_idf <- ungroup(top_n(group_by(tf_idf, author), 20, tf_idf))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/tf_idf_author")
ggplot(tf_idf) +
  geom_col(aes(word, tf_idf, fill = author)) +
  labs(x = NULL, y = "tf-idf") +
  theme(legend.position = "none") +
  facet_wrap(~ author, ncol = 3, scales = "free") +
  coord_flip() +
  labs(y = "TF-IDF values")

6. Data Visualization

We’ll do some simple numerical summaries of the data to provide some nice visualizations.
p1 <- ggplot(spooky) +
      geom_bar(aes(author, fill = author)) +
      theme(legend.position = "none")


spooky$sen_length <- str_length(spooky$text)
head(spooky$sen_length)
## [1] 231  71 200 206 174 468
p2 <- ggplot(spooky) +
      geom_density_ridges(aes(sen_length, author, fill = author)) +
      scale_x_log10() +
      theme(legend.position = "none") +
      labs(x = "Sentence length [# characters]")


spooky_wrd$word_length <- str_length(spooky_wrd$word)
head(spooky_wrd$word_length)
## [1]  7  8  5 12 10  7
p3 <- ggplot(spooky_wrd) +
      geom_density(aes(word_length, fill = author), bw = 0.05, alpha = 0.3) +
      scale_x_log10() +
      theme(legend.position = "none") +
      labs(x = "Word length [# characters]")

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/data_visualization")
layout <- matrix(c(1, 2, 1, 3), 2, 2, byrow = TRUE)
multiplot(p1, p2, p3, layout = layout)
## Loading required package: grid
## Picking joint bandwidth of 0.0414

Part II Further analysis – are they appropriate for children to read?

The Motion Picture Association of America (MPAA) film rating system is used in the United States and its territories to rate a film’s suitability for certain audiences based on its content. The MPAA rating system is one of various motion picture rating systems that are used to help parents decide what films are appropriate for their children.
We would like to apply the idea of movie rating to book classification to help parents and children choose age-appropriate books.

1. Overview of Sentiment Analysis

When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. In this part, we will do some basic sentiment analysis.

1.1 Use NRC lexicon

get_sentiments('nrc')
sentiments <- inner_join(spooky_wrd, get_sentiments('nrc'), by = "word")

count(sentiments, sentiment)
count(sentiments, author, sentiment)
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/NRC_all_bar")
ggplot(count(sentiments, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/NRC_author")
ggplot(count(sentiments, author, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment)) + 
  facet_wrap(~ author) +
  coord_flip() +
  theme(legend.position = "none")

From above plot, we found that not much of a positive mood in HPL’s works. MWS used positive words almost as much as negative ones. There’s much “trust” and “joy” to counteract all the “sadness”, “fear”, and “anger” in the world.
Then, we will seperate positive and negative words of each author to see more details.
  1. EAP
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/NRC_EAP")
sentiments_EAP <- sentiments%>%
  filter(author == "EAP" & (sentiment == "positive" | sentiment == "negative")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_EAP,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("Edgar Allan Poe - Sentiment analysis")

It is interesting that his most-used negative words are “words” and “doubt” - pointing to a different kind of horror of the imagination. His most-uesd positive word is “found”.This made me much more interested in his works. I can’t help wondering what I can find in his work.
  1. HPL
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/NRC_HPL")
sentiments_HPL <- sentiments%>%
  filter(author == "HPL" & (sentiment == "positive" | sentiment == "negative")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_HPL,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("HP Lovecraft - Sentiment analysis")

We found that Mary Shelley used positive words almost as much as negative ones. She used “love” a lot as a positive word. “Death” is clearly her most common negative word.
We have a question now. Will the result be different if we use other lexicons?
Let’s change from “nrc” to “bing” to get more information about sentiment analysis.

1.2 Use Bing Lexicon

get_sentiments('bing')
sentiments_bing <- inner_join(spooky_wrd, get_sentiments('bing'), by = "word")

count(sentiments_bing, sentiment)
count(sentiments_bing, author, sentiment)
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/bing_all_bar")
ggplot(count(sentiments_bing, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/bing_author")
ggplot(count(sentiments_bing, author, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment)) + 
  facet_wrap(~ author) +
  coord_flip() +
  theme(legend.position = "none")

After applying “bing” lexicon, we found that overall, these three authors used more negative words than positive words.
Then, we will also seperate positive and negative words of each author to see more details.
  1. EAP
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/bing_EAP")
sentiments_bing_EAP <- sentiments_bing%>%
  filter(author == "EAP") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_bing_EAP,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("Edgar Allan Poe - Sentiment analysis")

“doubt” and “death” became the most uesd negative words in EAP’s works. While the word “word” which is the most used negative word in NRC lexicon analysis is no loner existing. The words in positive part are also different from NRC lexicon analysis.
  1. HPL
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/bing_HPL")
sentiments_bing_HPL <- sentiments_bing%>%
  filter(author == "HPL") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_bing_HPL,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("HP Lovecraft - Sentiment analysis")

There is not much positive mood in HPL’s works. The most uesd negative word and positive word are both different from NRC lexicon analysis. In this result, his most uesd negative word is “strange” which is somehow more reasonable than “ancient” in NRC.
  1. MWS
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/bing_MWS")
sentiments_bing_MWS <- sentiments_bing%>%
  filter(author == "MWS") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_bing_MWS,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("Mary Shelley - Sentiment analysis")

From above plot, we can find that she used “love” a lot as a positive word. “Death” is clearly her most common negative word.To my surprise, the most used negative and positive words are the same with the NRC lexicon analysis. She used positive words almost as much as negative ones.
After comparing the results generated by different lexicons, I decided to use “bing” lexicon to do further analysis. Next, we will only study “negative” words to find some clues about the theroy of book classification.

2. Comparing negativity

bing_neg <- filter(get_sentiments('bing'), sentiment == "negative")
bing_neg
negative <- inner_join(spooky_wrd, bing_neg, by = "word")
head(negative)
count(negative, word, sort = TRUE)
First, we plot a frequency comparison of these “negative” words.
neg_words     <- count(group_by(negative, word, author))
neg_words_all <- count(group_by(negative, word))

neg_words <- left_join(neg_words, neg_words_all, by = "word")
neg_words <- arrange(neg_words, desc(n.y))
neg_words <- ungroup(head(neg_words, 81))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/Freq_negative")
ggplot(neg_words) +
  geom_col(aes(reorder(word, n.y, FUN = min), n.x, fill = author)) +
  xlab(NULL) +
  coord_flip() +
  facet_wrap(~ author) +
  theme(legend.position = "none")

Then, we go a step further and assign a “negativity fraction” to each sentence; defined in the same way as the other index: # negative / (# negative + # positive). We plot the distribution of these negativity indeces for the three authors:
p1 <- ggplot(sentiments_bing, aes(author, fill = sentiment)) + geom_bar(position = "fill")


p2 <- group_by(sentiments_bing, author, id, sentiment) %>%
  count() %>%
  spread(sentiment, n, fill = 0) %>%
  group_by(author, id) %>%
  summarise(neg = sum(negative),
            pos = sum(positive)) %>%
  arrange(id) %>%
  mutate(frac_neg = neg/(neg + pos)) %>%
  ggplot(aes(frac_neg, fill = author)) +
  geom_density(bw = .2, alpha = 0.3) +
  theme(legend.position = "right") +
  labs(x = "Fraction of negative words per sentence")

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/Freq_negative_fraction")
layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)

HPL’s works are more negative than EAP and MWS. From the above plots, we know clearly when the fraction of negative words per sentence is between 0 and 0.5, the probability of negative words of MWS is lager than EAP and HPL.

3. Violence level analysis

The Motion Picture Association of America (MPAA) film rating system has five components: Violence, Language, Substance abuse, Nudity and Sexual content. Considering the last 4 are more difficult to choose the standard for comparison, so here we only analyze the first one – Violence.
We first built a dictionary contains all the common violent words. The violent words come from http://www.thesaurus.com. For example: crazy, cruel, fierce.
Then, we counted these words appeared in entire dataset and in each author’s works.
#build a dictionary which contains all the common violence words. 
Vio <- data.frame(word=c('brutal','crazy','cruel','fierce','homicidal','hysterical','murderous','passionate','potent','powerful','savage','uncontrollable','vicious','agitated','aroused','berserk','bloodthirsty','coercive','demoniac','desperate','distraught','disturbed','enraged','fiery','forceful','forcible','frantic','fuming','furious','great','headstrong','hotheaded','impassioned','impetuous','inflamed','intemperate','mad','maddened','maniacal','mighty','raging','riotous','rough','strong','ungovernable','unrestrained','urgent','vehement','wild'),stringsAsFactors=FALSE)
head(Vio)
#count these words appeared in spooky data
vio_words <- inner_join(Vio, spooky_wrd, by = "word")
lapply(vio_words, function(x)which(is.na(x)))
## $word
## integer(0)
## 
## $id
## integer(0)
## 
## $author
## integer(0)
## 
## $word_length
## integer(0)
vio_words <- count(group_by(vio_words, word,author))
head(vio_words)
# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/violent_word_all")
ggplot(vio_words) + 
  geom_col(aes(word,n,fill = word))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/violent_word_author")
ggplot(vio_words) + 
  geom_col(aes(word, n, fill = word)) + 
  facet_wrap(~ author) +
  coord_flip() +
  theme(legend.position = "none")

We found that these violent terms are indeed distributed among the three authors’ works. Due to the lack of further information, we can not set a standard to justify whether there is too much violent mood in their works. Maybe in the future, we will have a baseline, for example, when these violent terms make up about 20% or more of the entire article, children need to read such books under the guidance of their parents.

4. Topic Modeling

Topic modeling is a method for unsupervised classification of documents by themes, similar to clustering on numeric data. We’re trying to look through the content of each author’s works to identify whether they are appropriate for children to read by running Latent Dirichlet Allocation.
# Counts how many times each word appears in each sentence
sent_wrd_freqs <- count(spooky_wrd, id, word)
head(sent_wrd_freqs)
# Creates a DTM matrix
spooky_wrd_tm <- cast_dtm(sent_wrd_freqs, id, word, n)
spooky_wrd_tm
## <<DocumentTermMatrix (documents: 19467, terms: 24941)>>
## Non-/sparse entries: 193944/485332503
## Sparsity           : 100%
## Maximal term length: 19
## Weighting          : term frequency (tf)
The matrix spooky_wrd_tm is a sparse matrix with 19467 rows, corresponding to the 19467 ids (or originally, sentences) in the spooky_wrd dataframe, and 24941 columns corresponding to the total number of unique words in the spooky_wrd dataframe. So each row of spooky_wrd_tm corresponds to one of the original sentences. The value of the matrix at a certain position is then the number of occurences of that word (determined by the column) in this specific sentence (determined by the row). Since most sentence/word pairings don’t occur, the matrix is sparse meaning there are many zeros.
For LDA we must pick the number of possible topics. I tried 12, 10, 8 and 6. Finnally I chose 6.
spooky_wrd_lda <- LDA(spooky_wrd_tm, k = 6, control = list(seed = 1234))
spooky_wrd_topics <- tidy(spooky_wrd_lda, matrix = "beta")
spooky_wrd_topics
spooky_wrd_lda
## A LDA_VEM topic model with 6 topics.
We note that in the above we use the tidy function to extract the per-topic-per-word probabilities, called “beta” or \(\beta\), for the model. The final output has a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic.
# Grab the top five words for each topic.
spooky_wrd_topics_5 <- ungroup(top_n(group_by(spooky_wrd_topics, topic), 5, beta))
spooky_wrd_topics_5 <- arrange(spooky_wrd_topics_5, topic, -beta)
spooky_wrd_topics_5 <- mutate(spooky_wrd_topics_5, term = reorder(term, beta))

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/topic_modeling")
ggplot(spooky_wrd_topics_5) +
  geom_col(aes(term, beta, fill = factor(topic)), show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) +
  coord_flip()

Also, we see that these 6 topics are quite similar. I can hardly tell the difference between these topics. The only thing I can be sure of is that these words remind me of all the horror stories I might think of in my life.
Therefore, let’s study terms that have the greatest difference in probabilities between the topics, ignoring the words that are shared with similar frequency between topics. We choose only the first 3 topics as example and visualise the differences by plotting log ratios: \(log_{10}(\beta \text{ of topic x }/ \beta \text{ of topic y})\). So if a word is 10 times more frequent in topic x the log ratio will be 1, whereas it will be -1 if the word is 10 times more frequent in topic y.
spooky_wrd_topics <- mutate(spooky_wrd_topics, topic = paste0("topic", topic))
spooky_wrd_topics <- spread(spooky_wrd_topics, topic, beta)

spooky_wrd_topics_12 <- filter(spooky_wrd_topics, topic1 > .001 | topic2 > .001)
spooky_wrd_topics_12 <- mutate(spooky_wrd_topics_12, log_ratio = log10(topic2 / topic1))
spooky_wrd_topics_12 <- group_by(spooky_wrd_topics_12, direction = log_ratio > 0)
spooky_wrd_topics_12 <- ungroup(top_n(spooky_wrd_topics_12, 5, abs(log_ratio)))
spooky_wrd_topics_12 <- mutate(spooky_wrd_topics_12, term = reorder(term, log_ratio))

p1 <- ggplot(spooky_wrd_topics_12) +
      geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
      theme(legend.position = "none") +
      labs(y = "Log ratio of beta in topic 2 / topic 1") +
      coord_flip()


spooky_wrd_topics_23 <- filter(spooky_wrd_topics, topic2 > .001 | topic3 > .001)
spooky_wrd_topics_23 <- mutate(spooky_wrd_topics_23, log_ratio = log10(topic3 / topic2))
spooky_wrd_topics_23 <- group_by(spooky_wrd_topics_23, direction = log_ratio > 0)
spooky_wrd_topics_23 <- ungroup(top_n(spooky_wrd_topics_23, 5, abs(log_ratio)))
spooky_wrd_topics_23 <- mutate(spooky_wrd_topics_23, term = reorder(term, log_ratio))

p2 <- ggplot(spooky_wrd_topics_23) +
      geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
      theme(legend.position = "none") +
      labs(y = "Log ratio of beta in topic 3 / topic 2") +
      coord_flip()

spooky_wrd_topics_13 <- filter(spooky_wrd_topics, topic3 > .001 | topic1 > .001)
spooky_wrd_topics_13 <- mutate(spooky_wrd_topics_13, log_ratio = log10(topic3 / topic1))
spooky_wrd_topics_13 <- group_by(spooky_wrd_topics_13, direction = log_ratio > 0)
spooky_wrd_topics_13 <- ungroup(top_n(spooky_wrd_topics_13, 5, abs(log_ratio)))
spooky_wrd_topics_13 <- mutate(spooky_wrd_topics_13, term = reorder(term, log_ratio))

p3 <- ggplot(spooky_wrd_topics_13) +
      geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
      theme(legend.position = "none") +
      labs(y = "Log ratio of beta in topic 3 / topic 1") +
      coord_flip()

# jpeg("/Users/yunli/Documents/GitHub/spring2018-project1-YUNLI531/figs/topic_modeling_123")
layout <- matrix(c(1,2,3), 3, 1, byrow = TRUE)
multiplot(p1, p2, p3, layout = layout)

In the above, I guess topic 1 may occur in the wild because this topic contains words like “moon”, “sky” and “air” which are more popular than other two. Topic 2 may happen at a late night party. The story may have taken place in a study room full of books, all of them standing together.

Think further

Due to the lack of appropriate judgments of children, the film classification system has largely avoided the children’s exposure to violence, drugs and other works that are likely to have adverse effects on them. Similarly, in the literary world, similar hierarchies are needed to help children grow healthily.
In this project, I didn’t find a way to set up a standard to justify whether there is too much violent mood in their works. I would like to explore more in this area and hope in the future, we will have a baseline, for example, when these violent terms make up about 20% or more of the entire article, children need to read such books under the guidance of their parents.